92 research outputs found

    UpStream: storage-centric load management for streaming applications with update semantics

    Get PDF
    This paper addresses the problem of minimizing the staleness of query results for streaming applications with update semantics under overload conditions. Staleness is a measure of how out-of-date the results are compared with the latest data arriving on the input. Real-time streaming applications are subject to overload due to unpredictably increasing data rates, while in many of them, we observe that data streams and queries in fact exhibit "update semantics” (i.e., the latest input data are all that really matters when producing a query result). Under such semantics, overload will cause staleness to build up. The key to avoid this is to exploit the update semantics of applications as early as possible in the processing pipeline. In this paper, we propose UpStream, a storage-centric framework for load management over streaming applications with update semantics. We first describe how we model streams and queries that possess the update semantics, providing definitions for correctness and staleness for the query results. Then, we show how staleness can be minimized based on intelligent update key scheduling techniques applied at the queue level, while preserving the correctness of the results, even for complex queries that involve sliding windows. UpStream is based on the simple idea of applying the updates in place, yet with great returns in terms of lowering staleness and memory consumption, as we also experimentally verify on the Borealis syste

    Event Detection on Twitter

    Full text link
    Detecting events by using social media has been an active research problem. In this work, we investigate and compare the performance of two methods for event detection in Twitter by using Apache Storm as the stream processing infrastructure. The first event detection method is based on identifying uncommonly common words inside tweet blocks, and the second one is based on clustering tweets to detect a cluster as an event. Each of the methods has its own characteristics. Uncommonly common word based method relies on the burst of words and hence is not affected from concurrency problems in distributed environment. On the other hand, clustering based method includes a finer grained analysis, but it is sensitive to the concurrent processing. We investigate the effect of stream processing and concurrency handling support provided by Apace Storm on event detection by these methods

    Efficiently correlating complex events over live and archived data streams

    Get PDF
    Correlating complex events over live and archived data streams, which we call Pattern Correlation Queries (PCQs), provides many benefits for domains which need real-time forecasting of events or identification of causal dependencies, while handling data at high rates and in massive amounts, like in financial or medical settings. Existing work has focused either on complex event processing over a single type of stream source (i.e., either live or archived), or on simple stream correlation queries (e.g., live events trigerring a database lookup). In this paper, we specifically focus on recency-based PCQs and provide clear, useful, and optimizable semantics for them. PCQs raise a number of challenges in optimizing data management and query processing, which we address in the setting of the DejaVu complex event processing system. More specifically, we propose three complementary optimizations including recent in-put buffering, query result caching, and join source ordering. Fur-thermore, we capture the relevant query processing tradeoffs in a cost model. An extensive performance study on synthetic and real-life data sets not only validates this cost model, but also shows that our optimizations are very effective, achieving more than two orders magnitude throughput improvement and much better scala-bility compared to a conventional approach

    Electronic commerce An open electronic marketplace through agent-based workflows: MOPPET

    Get PDF
    Abstract. We propose an electronic marketplace architecture, called MOPPET, where the commerce processes in the marketplace are modeled as adaptable agent-based workflows. The higher level of abstraction provided by the workflow technology makes the customization of electronic commerce processes for different users possible. Agent-based implementation, on the other hand, provides for a highly reusable component-based workflow architecture as well as negotiation ability and the capability to adapt to dynamic changes in the environment. Agent communication is handled through Knowledge Query and Manipulation Language (KQML). A workflow-based architecture also makes it possible for complete modeling of electronic commerce processes by allowing involved parties to be able to invoke already existing applications or to define new tasks and to re-structure the control and data flow among the tasks to create custom built process definitions. In the proposed architecture all data exchanges are realized through Extensible Markup Language (XML) providing uniformity, simplicity and a highly open and interoperable architecture. Metadata of activities are expressed through Resource Description Framework (RDF). Common Business Library (CBL) is used for achieving interoperability across business domains and domain specific Document Type Definitions (DTDs) are used for vertical industries. We provide our own specifications for missing DTDs to be replaced by the original specifications when they become available

    Greenhouse: A Zero-Positive Machine Learning System for Time-Series Anomaly Detection

    Get PDF
    This short paper describes our ongoing research on Greenhouse - a zero-positive machine learning system for time-series anomaly detection

    Precision and Recall for Range-Based Anomaly Detection

    Get PDF
    Classical anomaly detection is principally concerned with point- based anomalies, anomalies that occur at a single data point. In this paper, we present a new mathematical model to express range- based anomalies, anomalies that occur over a range (or period) of time

    Neo: A Learned Query Optimizer

    Full text link
    Query optimization is one of the most challenging problems in database systems. Despite the progress made over the past decades, query optimizers remain extremely complex components that require a great deal of hand-tuning for specific workloads and datasets. Motivated by this shortcoming and inspired by recent advances in applying machine learning to data management challenges, we introduce Neo (Neural Optimizer), a novel learning-based query optimizer that relies on deep neural networks to generate query executions plans. Neo bootstraps its query optimization model from existing optimizers and continues to learn from incoming queries, building upon its successes and learning from its failures. Furthermore, Neo naturally adapts to underlying data patterns and is robust to estimation errors. Experimental results demonstrate that Neo, even when bootstrapped from a simple optimizer like PostgreSQL, can learn a model that offers similar performance to state-of-the-art commercial optimizers, and in some cases even surpass them

    Precision and Recall for Time Series

    Get PDF
    Classical anomaly detection is principally concerned with point-based anomalies, those anomalies that occur at a single point in time. Yet, many real-world anomalies are range-based, meaning they occur over a period of time. Motivated by this observation, we present a new mathematical model to evaluate the accuracy of time series classification algorithms. Our model expands the well-known Precision and Recall metrics to measure ranges, while simultaneously enabling customization support for domain-specific preferences

    Bao: Learning to Steer Query Optimizers

    Full text link
    Query optimization remains one of the most challenging problems in data management systems. Recent efforts to apply machine learning techniques to query optimization challenges have been promising, but have shown few practical gains due to substantive training overhead, inability to adapt to changes, and poor tail performance. Motivated by these difficulties and drawing upon a long history of research in multi-armed bandits, we introduce Bao (the BAndit Optimizer). Bao takes advantage of the wisdom built into existing query optimizers by providing per-query optimization hints. Bao combines modern tree convolutional neural networks with Thompson sampling, a decades-old and well-studied reinforcement learning algorithm. As a result, Bao automatically learns from its mistakes and adapts to changes in query workloads, data, and schema. Experimentally, we demonstrate that Bao can quickly (an order of magnitude faster than previous approaches) learn strategies that improve end-to-end query execution performance, including tail latency. In cloud environments, we show that Bao can offer both reduced costs and better performance compared with a sophisticated commercial system

    Modeling the execution semantics of stream processing engines with SECRET

    Get PDF
    There are many academic and commercial stream processing engines (SPEs) today, each of them with its own execution semantics. This variation may lead to seemingly inexplicable differences in query results. In this paper, we present SECRET, a model of the behavior of SPEs. SECRET is a descriptive model that allows users to analyze the behavior of systems and understand the results of window-based queries (with time- and tuple-based windows) for a broad range of heterogeneous SPEs. The model is the result of extensive analysis and experimentation with several commercial and academic engines. In the paper, we describe the types of heterogeneity found in existing engines and show with experiments on real systems that our model can explain the key differences in windowing behavio
    • …
    corecore